Web Genre Benchmark Under Construction

نویسندگان

  • Marina Santini
  • Serge Sharoff
چکیده

The project discussed in this article focuses on the creation of web genre benchmarks (a.k.a. web genre reference corpora or web genre test collections), i.e. newly conceived test collections against which it will be possible to judge the performance of future genre-enabled web applications. The creation of web genre benchmarks is of key importance for the next generation of web applications because, at present, it is impossible to evaluate existing and in-progress genre-enabled prototypes. We suggest focusing on the following key points: 1) propose a characterisation of genre suitable for digital environments and empirical approaches shared by a number of genre experts working in automatic genre identification; 2) define the criteria for the construction of web genre benchmarks and draw up annotation guidelines; 3) create several web genre benchmarks in several languages; 4) validate the methodology and evaluate the results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Genre Classification of Web Pages

Genre classification means to discriminate between documents by means of their form, their style, or their targeted audience. Put another way, genre classification is orthogonal to a classification based on the documents’ contents. While most of the existing investigations of an automated genre classification are based on news articles corpora, the idea here is applied to arbitrary Web pages. W...

متن کامل

Common Criteria for Genre Classification: Annotation and Granularity

In this paper,we present two experiments that use machine learning for automatically classifying web pages by genre. These experiments highlight the influence that genre annotation and genre granularity can have on the accuracy of the classification. From a practical point of view these experiments show that a collection annotated with the criteria of ‘objective sources’ and consistent genre gr...

متن کامل

Leveraging Website Genre and Structure Information for Fake Website Detection

In this study we assessed the efficacy of using website genre composition and design structure information for fake website detection. A genre tree kernel was proposed that creates a rooted tree from the website file directory structure, and labels the tree’s file nodes with genre information. The genre tree kernel was compared against several benchmark kernel and non-kernel methods that utiliz...

متن کامل

Refined and Incremental Centroid-based approach for Genre Categorization of Web pages

In this paper, I propose a refined and incremental centroid-based approach for genre categorization of web pages. My approach is based on the construction of genre centroids using a set of training web pages. These centroids will be used to classify new web pages. The originality of my approach is the implementation of two new aspects, which are refining and incrementing. My approach is based o...

متن کامل

Web Genre Analysis: Use Cases, Retrieval Models, and Implementation Issues

People who search the World Wide Web often have a multifaceted understanding of their information need: they know what they are searching for, and they know of which form or type the desired documents should be. The former aspect relates to the content of a desired document (= topic), the latter to the presentation of its content and the intended target group. Due to the different user groups a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • JLCL

دوره 24  شماره 

صفحات  -

تاریخ انتشار 2009